Steps that you should take before a problem occurs:
Monitoring systems
Baselining systems
Managing multiple versions of configuration files
Writing a disaster recovery plan
Being Proactive
Support Contracts
Support contracts for critical systems are essential.
Most large software and hardware vendors offer a range of support options.
Before you purchase a support contract, check what coverage you will actually get.
If support contracts are not enough, you may also need to keep spare hardware on-site.
If possible, configure spare hardware with automatic failover to minimize downtime.
This is called a "hot swap."
You may also want to keep a spare on-site for cold swap components.
Track warranties on hardware, as it may help you obtain spares quickly and efficiently.
Replace disks that are out of warranty.
Avoid using disks that are out of warranty for critical data.
Have a replacement plan in place, including funding and migration plans.
Know what to do when a warranty expires.
Being Proactive
Documentation
Maintain documents that fully outline and identify the following for your organization:
Hardware
Software
Configuration settings for each component
Being Proactive
Documentation
[student@server1 ~]$ man -k passwd
checkPasswdAccess (3) - query the SELinux policy database in the kernel.
chpasswd (8) - update passwords in batch mode
ckpasswd (8) - nnrpd password authenticator
fgetpwent_r (3) - get passwd file entry reentrantly
getpwent_r (3) - get passwd file entry reentrantly
...
passwd (1) - update user's authentication tokens
sslpasswd (1ssl) - compute password hashes
passwd (5) - password file
passwd.nntp (5) - Passwords for connecting to remote NNTP servers
passwd2des (3) - RFS password encryption
...
Being Proactive
Documentation
Most other documentation is found in the /usr/share/doc/ directory, in subdirectories named by the RPM package.
If it is not a man page, not an info page, and not part of the GNOME help
utility, it is stored here.
Many applications have their documentation packaged in a separate RPM
package, which may or may not be installed. In Red Hat Enterprise Linux 7, these packages are often found
in the Optional tree.
To locate the documentation supplied with an RPM package:
Use rpm -qd package to list all files flagged %doc.
Use rpm -qc package to list all configuration files distributed in the package.
References
man(1) and rpm(8) man pages
/usr/share/doc/packagename/
Monitoring: Centralized Logging
Information gathering is one of the most important phases of troubleshooting.
Log files, kernel output, and device output can all help you diagnose your system more quickly.
Knowing how to order and search output is essential in troubleshooting.
Commands such as grep, uniq, sort, and less are
fundamental to finding errors and identifying problems.
If possible, compare logs and output with a similar healthy system to locate relevant error messages.
Once you locate the errors, you can fix the problem, and then test.
Monitoring: Centralized Logging
Good logging practices are prerequisites to effective troubleshooting.
Ensure that syslog is running and configured to log information from important services on all systems.
Increase the loglevel to aid with troubleshooting. (For example, from info to debug.)
Ensure that important messages are forwarded to a central log server, perhaps one that is proactively watching the events to notify you of pending failures.
Red Hat Enterprise Linux 7 uses rsyslog for event logging, an enhanced syslog daemon providing support for both UDP and TCP transport, failover
destinations, and queued operations.
/etc/rsyslog.conf contains numerous comments.
See /usr/share/doc/rsyslog-*/ for more info
Monitoring: Centralized Logging
Configuring a Server to Accept Remote Log Messages Using UDP
Uncomment the following lines in /etc/rsyslog.conf:
$ModLoad imudp
$UDPServerRun 514
Restart the service:
[root@server1 ~]# systemctl restart rsyslog
Open the host firewall for inbound port 514/UDP and/or TCP
Monitoring: Centralized Logging
Forwarding Messages via UDP to a Central Log Server
Decide on the types of messages (facility and priority) and the name or IP address of the central log server.
Add a line similar to the following
to /etc/rsyslog.conf:
*.info @server1
Restart the service:
[root@desktop1 ~]# systemctl restart rsyslog
Test the forwarding rule with the logger command:
[root@desktop1 ~]# logger "Hello from desktop1"
[root@desktop1 ~]# tail /var/log/messages
Jan 18 14:24:37 desktop1 root: Hello from desktop1
[root@server1 ~]# tail /var/log/messages
Jan 18 14:24:37 desktop1 root: Hello from desktop1
Hard drives die. It is not a question of if a drive will die but rather when.
If you know that a drive is dying, you can plan for
its replacement instead of responding to an emergency.
SMART = Self-Monitoring, Analysis and Reporting Technology
SMART is built-in to almost all modern hard drives.
In Red Hat Enterprise Linux systems, the smartd SMART-daemon polls all of the hard drives every 30 minutes. ** If smartd sees that a drive is dying, it issues a message to /var/log/messages and sends an email message to the root user on the local system.
You can specify an alternate, centralized email address in /etc/smartmontools/smartd.conf.
Monitoring: Hard Drive Failures
Another method of talking to a SMART-enabled drive is with the smartctl tool.
One method of using smartctl is to ask for only the
overall health status:
[root@server1 ~]# smartctl -H /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-123.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
For more detailed information, query all the
individual counters: smartctl -a /dev/sda. The column you are interested in is RAW_VALUE.
To tell the drive to perform a test immediately, use
smartctl -t testtype /dev/sda, where testtype is either offline, long, or short.
To view the output of a selftest, (long, short), run smartctl -l selftest /dev/sda.
To get the output of the offline test or the errors
from any other test, run smartctl -l error /dev/sda.
Reference
smartd(8), smartd.conf(5), and smartctl(8) man pages
Baselining: Using AIDE
Good baseline monitoring of systems is extremely helpful when troubleshooting.
Compare when a system appears to be behaving erratically
Report when a system is operating outside of specified
parameters.
Tighten security.
Build trends for your systems and networks over time.
Use trends to spot events outside of the norm.
Deciding what to monitor depends on the work that a system does.
For database servers or file servers, disk space, service availability, and load might be important.
For a desktop system, you might just check
to see that it is running.
Long-term monitoring can be used to:
Measure growth of system load over time
Predict when a new server or file store is required
Measure how improvements impact service availability and help work flow and numerous other issues
Baselining: Using AIDE
AIDE = Advanced Intrusion Detection Environment
AIDE is a tool to check the integrity of files on the system.
When the system is in a known good state, it is used to scan the system and collect information about installed d files:
Checksums
Permissions
Other characteristics
Information is placed in a database file which can be stored offline.
Use AIDE to compare the state of the system against the stored database and check for any changes.
Baselining: Using AIDE
Steps to Deploy AIDE
The following is an example of deploying AIDE on server1.
@@define DBDIR /var/lib/aide (1)
@@define LOGDIR /var/log/aide
database=file:@@{DBDIR}/aide.db.gz (2)
database_out=file:@@{DBDIR}/aide.db.new.gz (3)
gzip_dbout=yes
report_url=file:@@{LOGDIR}/aide.log (4)
report_url=stdout
# R is short for p+i+n+u+g+s+m+c+acl+selinux+xattrs+md5
NORMAL = R+rmd160+sha256 (5)
PERMS = p+i+u+g+acl+selinux
/ NORMAL (6)
!/etc/.*~
/root/..* PERMS
1
Defines macros that can be used in /etc/aide.conf.
2
Configuration directive defining the location of the AIDE database.
Note that this example uses a macro defined above.
3
Configuration directive defining the location in which aide --init
will save a newly created database file.
4
Where the results of aide --check will be reported. Note that
multiple locations are allowed.
5
Group definition line. Files selected by AIDE in group NORMAL will
store information about its regular permissions, inodes, number of
links, user and group, size, mtime and ctime, POSIX ACLs, SELinux
context, extended attributes, MD5 checksum, RMD160 checksum, and SHA256
checksum.
6
Selection lines. The first one adds all files under / to be checked
in group NORMAL; the second exempts all files in /etc that end in
~ from being checked; the third specifies that all files under /root
that start with a period (.)te should be checked in group PERMS
only. Note that this uses regular expression syntax.
Run /usr/sbin/aide --init to build the initial database. This can
take a while as it creates a gzipped-database called
/var/lib/aide/aide.db.new.gz.
[root@server1 ~]# aide --init
AIDE, version 0.15.1
### AIDE database at /var/lib/aide/aide.db.new.gz initialized.
Store /etc/aide.conf, /usr/sbin/aide and
/var/lib/aide/aide.db.new.gz in a secure location (not on this same
system!). Alternatively, extract a signature of these files so they can
be verified in the future.
Copy /var/lib/aide/aide.db.new.gz to /var/lib/aide/aide.db.gz
(the expected name).
[root@server1 ~]# cd /var/lib/aide
[root@server1 aide]# cp aide.db.new.gz aide.db.gz
[root@server1 aide]# cd
Baselining: Using AIDE
Verifying System Integrity with AIDE
This next example demonstrates testing file integrity using aide.
Modify a file on your system to be different.
[root@server1 ~]# echo shiny new >> /bin/tcsh
Run /usr/sbin/aide --check to check your system for
inconsistencies.
[root@server1 ~]# aide --check
AIDE 0.15.1 found differences between database and filesystem!!
Start timestamp: 2014-12-15 08:22:04
Summary:
Total number of files: 107530
Added files: 9
Removed files: 0
Changed files: 10
---------------------------------------------------
Added files:
---------------------------------------------------
... Output omitted ...
---------------------------------------------------
Changed files:
---------------------------------------------------
changed: /usr/bin/tcsh
... Output omitted ...
Results are displayed on standard output and in /var/log/aide/aide.log by default.
If you know about these changes, you can run aide --update to update your database and store it in a secure location again.
Network monitoring measures network activity, and looks for
slow or failing servers, routers, switches, or other devices.
There are active and passive monitoring techniques that may involve agents residing on the network equipment that notify or are polled by a network management system.
Many enterprises use network monitoring/management systems and services from CA, HP, IBM, and other vendors.
Nagios is a free open source monitoring tool.
Provided via the EPEL (Extra Packages for Enterprise Linux) repository from the Fedora project
Not supported by Red Hat
Modular system consisting of a core nagios package with
additional functionality provided by plug-ins
Plug-ins can run on local machines to provide information not readily available via the network
Flexible configuration allows definitions of time periods, admin groups, system groups, and custom command sets
Web-based interface on main nagios server allows configuring tests and settings for Nagios and hosts
it is monitoring